library(xcms)
library(tidyverse)
library(plotly)
A fast recap about what we know about metabolites in LC-MS …
So every metabolite …
So …
The process of finding chromatographic peaks in the data is called peak picking. It can be done in many different ways, and actually every software will do it slightly differently. The first step in the analysis of our dataset will be to pick the peaks in the full set of samples. Here I’ll show you the process on one file.
There are some important point to remember.
Working on a representative file will help you in fine tuning and benchmarking your parameters
Let’s start reeding in a raw file
raw_one <- readMSData(
files = "../data/wines/x016_X_QC_X_4_NEG_DDA.mzML",
msLevel. = 1, ## we read only MS1
mode = "onDisk") ## with this parameter the data are not loaded into RAM
I’ll now show to you how to perform peak picking with two algoritms availabe in xcms. In the 99% of the case you will use only one of them (CentWave), but it is nice - once in the life - to really put your hands in the machine
The “older” and most sounding way of finding peaks implemented in
xcms is the matched filter algorithm.
A full description of the parameters of the algorithm can be found in
the xcms
manual, here we focus on:
In xcms the parameters of the algorithm are stored into a specific object:
mf <- MatchedFilterParam(binSize = 0.1,
fwhm = 6,
snthresh = 5)
mf
## Object of class: MatchedFilterParam
## Parameters:
## - binSize: [1] 0.1
## - impute: [1] "none"
## - baseValue: numeric(0)
## - distance: numeric(0)
## - fwhm: [1] 6
## - sigma: [1] 2.547987
## - max: [1] 5
## - snthresh: [1] 5
## - steps: [1] 2
## - mzdiff: [1] 0.6
## - index: [1] FALSE
Now I can use the previous parameters to find the peaks in one sample:
## this function is used to find the chromatographic peaks with the previous parameters
raw_one_mf_picked <- findChromPeaks(raw_one, param = mf)
raw_one_mf_picked
## MSn experiment data ("XCMSnExp")
## Object size in memory: 0.24 Mb
## - - - Spectra data - - -
## MS level(s): 1
## Number of spectra: 546
## MSn retention times: 0:01 - 11:60 minutes
## - - - Processing information - - -
## Data loaded [Wed Nov 16 11:00:42 2022]
## Filter: select MS level(s) 1. [Wed Nov 16 11:00:42 2022]
## MSnbase version: 2.22.0
## - - - Meta data - - -
## phenoData
## rowNames: x016_X_QC_X_4_NEG_DDA.mzML
## varLabels: sampleNames
## varMetadata: labelDescription
## Loaded from:
## x016_X_QC_X_4_NEG_DDA.mzML
## protocolData: none
## featureData
## featureNames: F1.S0001 F1.S0004 ... F1.S1636 (546 total)
## fvarLabels: fileIdx spIdx ... spectrum (35 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## - - - xcms preprocessing - - -
## Chromatographic peak detection:
## method: matchedFilter
## 638 peaks identified in 1 samples.
## On average 638 chromatographic peaks per sample.
Ok, the software did his job. As you can see it was able to fine 638
peaks in this sample. As you can see the raw_one_mf_picked
still holds peaks and raw data. The peak table can be extracted with a
specific method
mf_peaks <- chromPeaks(raw_one_mf_picked)
dim(mf_peaks)
## [1] 638 13
head(mf_peaks, 5)
## mz mzmin mzmax rt rtmin rtmax into intf
## CP001 50.01399 50.01241 50.04396 337.7074 334.5620 340.1007 358728.6 387501.5
## CP002 51.67812 51.67799 51.67819 306.2680 303.3541 311.2829 529555.9 428820.5
## CP003 53.00739 53.00735 53.00742 327.9940 326.5501 330.6442 1460325.7 1670480.7
## CP004 53.23135 53.23127 53.23137 303.3541 301.9032 306.2680 419943.3 519133.2
## CP005 54.56881 54.56869 54.56895 276.0245 273.0395 277.6028 183758.4 203009.4
## maxo maxf i sn sample
## CP001 99313.48 102323.41 1 6.905211 1
## CP002 115270.37 94863.14 1 7.721503 1
## CP003 478205.50 461500.98 1 16.710695 1
## CP004 166208.92 141195.95 1 6.904875 1
## CP005 37604.99 45538.38 1 5.020740 1
Let’s walk to the most relevant columns:
Let’s now give a look to the position of the peaks in the mz/rt plane. The size of the point will be proportional to the intensity
mf_peaks %>%
as.data.frame() %>%
mutate(into = sqrt(into)) %>%
ggplot() +
geom_point(aes(x = rt, y = mz, alpha = into, col = into)) +
scale_color_viridis_c() +
theme_bw()
If you go back to the previous demo, we were focussing on a specific area of the raw signal which was particularly promising
rt <- rtime(raw_one)
mz <- mz(raw_one)
I <- intensity(raw_one)
sub_peaks_mf <- mf_peaks %>%
as.data.frame() %>%
filter(mz > 284 & mz < 300) %>%
filter(rt > 200 & rt < 300)
ggplotly(tibble(rt = rt, mz = mz, I = I) %>%
unnest(c("mz","I")) %>%
filter(mz > 284 & mz < 300) %>%
filter(rt > 200 & rt < 300) %>%
ggplot() +
geom_point(aes(x = rt, y = mz, col = log10(I), size = I)) +
geom_point(data = sub_peaks_mf, aes(x = rt, y = mz), col = "red", pch = 4, size = 3) +
scale_color_viridis_c() +
theme_light())
What we see:
This view gives an idea of the boundaries of the peaks
tibble(rt = rt, mz = mz, I = I) %>%
unnest(c("mz","I")) %>%
filter(mz > 284 & mz < 300) %>%
filter(rt > 200 & rt < 300) %>%
ggplot() +
geom_point(aes(x = rt, y = mz, col = log10(I), size = I)) +
geom_point(data = sub_peaks_mf, aes(x = rt, y = mz), col = "red", pch = 4, size = 3) +
geom_segment(data = sub_peaks_mf, aes(x = rtmin, xend = rtmax, y = mz, yend = mz), col = "red") +
scale_color_viridis_c() +
theme_light()
So some lines are superimposed, some others not. Tricky business!
Peak picking can also be performed with another algorithm:
CentWave. The algorithm applied here is more clever and
better suited for high resolution data. As we have seen in the lecture,
it relies on the fact that the mass trace get stable in presence of a
strong ionic signal
Also here many parameters (and others are not mentioned). I highlight here some of them:
cwp <- CentWaveParam(peakwidth = c(5, 30), ## expected range of chromatographic peak width
ppm = 5, ## tolerance to identify ROIs in the mz/rt plane
prefilter = c(5, 50000),## number of consecutive scans showing a signal higher than 50000
noise = 5000) ## minimum signal to be considered
cwp
## Object of class: CentWaveParam
## Parameters:
## - ppm: [1] 5
## - peakwidth: [1] 5 30
## - snthresh: [1] 10
## - prefilter: [1] 5 50000
## - mzCenterFun: [1] "wMean"
## - integrate: [1] 1
## - mzdiff: [1] -0.001
## - fitgauss: [1] FALSE
## - noise: [1] 5000
## - verboseColumns: [1] FALSE
## - roiList: list()
## - firstBaselineCheck: [1] TRUE
## - roiScales: numeric(0)
## - extendLengthMSW: [1] FALSE
If we run the peak picking with this new algorithm…
raw_one_cw_picked <- findChromPeaks(raw_one, param = cwp)
raw_one_cw_picked
## MSn experiment data ("XCMSnExp")
## Object size in memory: 0.24 Mb
## - - - Spectra data - - -
## MS level(s): 1
## Number of spectra: 546
## MSn retention times: 0:01 - 11:60 minutes
## - - - Processing information - - -
## Data loaded [Wed Nov 16 11:00:42 2022]
## Filter: select MS level(s) 1. [Wed Nov 16 11:00:42 2022]
## MSnbase version: 2.22.0
## - - - Meta data - - -
## phenoData
## rowNames: x016_X_QC_X_4_NEG_DDA.mzML
## varLabels: sampleNames
## varMetadata: labelDescription
## Loaded from:
## x016_X_QC_X_4_NEG_DDA.mzML
## protocolData: none
## featureData
## featureNames: F1.S0001 F1.S0004 ... F1.S1636 (546 total)
## fvarLabels: fileIdx spIdx ... spectrum (35 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## - - - xcms preprocessing - - -
## Chromatographic peak detection:
## method: centWave
## 567 peaks identified in 1 samples.
## On average 567 chromatographic peaks per sample.
Ok here we see that the number of spotted peaks is different. If you think to it it is not strange: different method different results.
The first natural question you can ask is: where are you getting the parameters? Well, as I already told you at some point the “expert knowledge” enters the process. Reasonable guess for them is always coming from a good knowledge of the analytical pipeline. Optimal value could be optimized in an automatic way (like IPO does), but reasonable guesses are fundamental to restrict the quest for the optimal solution in a multidimensional space. So either you understand the analytics, or you should go and speak with you “analytical” colleague.
The second question. Are we getting comparable results?
cw_peaks <- chromPeaks(raw_one_cw_picked)
dim(cw_peaks)
## [1] 567 11
head(cw_peaks, 5)
## mz mzmin mzmax rt rtmin rtmax into intb
## CP001 218.9821 218.9820 218.9822 3.1473 0.6727 8.1741 547614.9 547606.2
## CP002 218.9822 218.9820 218.9824 35.8160 29.5353 43.3485 637327.6 637313.8
## CP003 218.9819 218.9817 218.9823 11.9442 8.1741 16.9582 596261.7 596251.7
## CP004 218.9821 218.9818 218.9823 20.7363 16.9582 24.4925 462471.5 462462.7
## CP005 174.9562 174.9559 174.9564 3.1473 0.6727 11.9442 690276.5 690264.0
## maxo sn sample
## CP001 79109.52 79109 1
## CP002 65368.86 65368 1
## CP003 71700.71 71700 1
## CP004 60961.76 60961 1
## CP005 67842.15 67841 1
ggplotly(mf_peaks %>%
as.data.frame() %>%
ggplot() +
geom_point(aes(x = rt, y = mz), col = "red", pch = 3, alpha = 0.5) +
geom_point(cw_peaks %>% as_tibble(), mapping = aes(x = rt, y = mz), col = "blue", pch = 1, alpha = 0.5) +
theme_bw())
Different, isn’t it? In some cases the two algorithms are coherent, in others the results are markedly different. Centwave it is also giving this strange horizontal line at large rt
Let’s give a look to our subregion
sub_peaks_cw <- cw_peaks %>%
as.data.frame() %>%
filter(mz > 284 & mz < 300) %>%
filter(rt > 200 & rt < 300)
ggplotly(tibble(rt = rt, mz = mz, I = I) %>%
unnest(c("mz","I")) %>%
filter(mz > 284 & mz < 300) %>%
filter(rt > 200 & rt < 300) %>%
ggplot() +
geom_point(aes(x = rt, y = mz, col = log10(I), size = I)) +
geom_point(data = sub_peaks_cw, aes(x = rt, y = mz), col = "red", pch = 4, size = 3) +
scale_color_viridis_c() +
theme_light())
In real life situations, when you are happy with your peak picking
parameters you run the process on all your samples. xcms is
designed for that, the only thing you have to do is to load more than on
e file.
In view of the step in which the peak list will be merged together it is important also to add to the object containing the raw data any type of meta information which is associated to the samples. Typical meta information includes:
in xcms all these infos are stored in an
AnnotatedDataFrame object
Typically all these info should be included in the filename, or in a
csv which links the filename with the metadata. In the dataset we are
using everything is in the file name, so let’s process them. In the
following we will assume that the data are stored as mzML
inside the data subfolder.
library(tools) ## just to use file_path_sans_ext
phenodata <- tibble(filepath = list.files("../data/wines/", pattern = ".mzML", full.names = TRUE)) %>%
mutate(fname = file_path_sans_ext(basename(filepath))) %>%
separate(fname, into = c("inj_ord","color","variety","bottle","rep","polarity","mode"),
sep = "_", remove = FALSE) %>%
as(.,"AnnotatedDataFrame")
## Warning: Expected 7 pieces. Missing pieces filled with `NA` in 3 rows [15, 17,
## 19].
## to see the content
pData(phenodata)
## # A tibble: 32 × 9
## filepath fname inj_ord color variety bottle rep polar…¹ mode
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ../data/wines//x001_X… x001… x001 X blank X 1 NEG DDA
## 2 ../data/wines//x002_X… x002… x002 X QC X 1 NEG DDA
## 3 ../data/wines//x003_X… x003… x003 X QC X 2 NEG DDA
## 4 ../data/wines//x004_r… x004… x004 red sangio B 1 NEG DDA
## 5 ../data/wines//x005_r… x005… x005 red ptnoir A 1 NEG DDA
## 6 ../data/wines//x006_r… x006… x006 red merlot A 1 NEG DDA
## 7 ../data/wines//x007_w… x007… x007 wht vermen A 1 NEG DDA
## 8 ../data/wines//x008_r… x008… x008 red cannon A 1 NEG DDA
## 9 ../data/wines//x009_w… x009… x009 wht chardo B 1 NEG DDA
## 10 ../data/wines//x010_w… x010… x010 wht chardo A 1 NEG DDA
## # … with 22 more rows, and abbreviated variable name ¹polarity
Here we read the data in
raw_data <- raw_data <- readMSData(
files = phenodata$filepath,
msLevel. = 1,
pdata = phenodata, ## this is the structure of xcms holding phenotypic data
mode = "onDisk") ## with this parameter the data are not loaded into RAM
And perform the peak picking with centwave
register(SerialParam())
raw_data <- findChromPeaks(raw_data, param = cwp)
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 43 regions of interest ... OK: 20 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 718 regions of interest ... OK: 614 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 710 regions of interest ... OK: 634 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 734 regions of interest ... OK: 597 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 677 regions of interest ... OK: 568 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 810 regions of interest ... OK: 670 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 563 regions of interest ... OK: 524 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 756 regions of interest ... OK: 659 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 521 regions of interest ... OK: 497 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 733 regions of interest ... OK: 684 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 668 regions of interest ... OK: 488 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 742 regions of interest ... OK: 626 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 665 regions of interest ... OK: 593 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 44 regions of interest ... OK: 21 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 138 regions of interest ... OK: 104 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 712 regions of interest ... OK: 526 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 144 regions of interest ... OK: 103 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 707 regions of interest ... OK: 567 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 155 regions of interest ... OK: 127 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 538 regions of interest ... OK: 479 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 836 regions of interest ... OK: 609 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 509 regions of interest ... OK: 475 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 625 regions of interest ... OK: 666 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 741 regions of interest ... OK: 646 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 802 regions of interest ... OK: 598 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 715 regions of interest ... OK: 652 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 706 regions of interest ... OK: 533 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 707 regions of interest ... OK: 516 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 669 regions of interest ... OK: 639 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 39 regions of interest ... OK: 8 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 721 regions of interest ... OK: 677 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 743 regions of interest ... OK: 612 found.
To spare time we save the picked object …
save(raw_data, file = "wines.RData")
Let’s give a look to the content of the full dataset
raw_data
## MSn experiment data ("XCMSnExp")
## Object size in memory: 4.84 Mb
## - - - Spectra data - - -
## MS level(s): 1
## Number of spectra: 17243
## MSn retention times: 0:01 - 12:00 minutes
## - - - Processing information - - -
## Data loaded [Wed Nov 16 11:01:13 2022]
## Filter: select MS level(s) 1. [Wed Nov 16 11:01:13 2022]
## MSnbase version: 2.22.0
## - - - Meta data - - -
## phenoData
## rowNames: 1 2 ... 3 (32 total)
## varLabels: filepath fname ... mode (9 total)
## varMetadata: labelDescription
## Loaded from:
## [1] x001_X_blank_X_1_NEG_DDA.mzML... [32] x029_X_QC_X_6_NEG_DDA.mzML
## Use 'fileNames(.)' to see all files.
## protocolData: none
## featureData
## featureNames: F01.S0001 F01.S0004 ... F32.S1627 (17243 total)
## fvarLabels: fileIdx spIdx ... spectrum (35 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## - - - xcms preprocessing - - -
## Chromatographic peak detection:
## method: centWave
## 15732 peaks identified in 32 samples.
## On average 492 chromatographic peaks per sample.
Here we are dealing with 29 files with different characteristics. After the peak picking optimization step we are confident that the parameters we have been choosing should give reasonably good results. Before going on, it is however useful to make some additional checks on the data quality.
The first thing to do is to monitor the TIC of the 29 experiments to spot potential drops of the signal during the analysis
tics <- chromatogram(raw_data, aggregationFun = "sum",
include = "none") ## this argument is required to avoid
## including all the chromatograms of the picked peaks
We now plot them, with colors matching the sample class
mypalette <- c("steelblue", "coral", "darkgreen")
names(mypalette) <- c("red","X","wht")
plot(tics, col = mypalette[raw_data$color])
legend("topright", legend = c("red","Blank & QC","white"), col = mypalette, lty = 1)
The plot shows already nice things:
The trend with the injection can be visualized as follows
tc <- split(tic(raw_data), f = fromFile(raw_data))
full_tics <- sapply(tc, sum)
pData(raw_data) %>%
tibble() %>%
add_column(full_tics = full_tics) %>%
ggplot() +
geom_point(aes(x = inj_ord, y = full_tics, col = color), size = 2) +
theme_bw() +
theme(aspect.ratio = 0.3)
So no clear trend is visible in the data. X here represents QC and blanks.
Another type ov visualization that I find useful with reasonably small datasets is the following
chromPeaks(raw_data) %>%
as_tibble() %>%
ggplot() +
geom_point(aes(x = rt, y = mz, col = log10(maxo)), size = 1, alpha = 0.5) +
scale_color_viridis_c() +
facet_wrap(~factor(sample)) +
theme_bw() +
theme(aspect.ratio = 0.5)
Here we see the peak maps of the different files. It is clear that each map is expected to be slightly different, but rally outlying samples will show up clearly. Look to the blanks, for example, or to the horizontal lines which are present in many files
Something for you now:
filterFile function to get rid of the
blanks!)? Are they the same?
Comments
centWaveis non to be more reliable results